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Abstract 

A tabulation-based hash function maps a key into d derived characters 
indexing random values in tables that are then combined with bitwise xor 
operations to give the hash. Thorup and Zhang [8] presented d-wise in- 
dependent tabulation-based hash classes that use linear maps over finite 
fields to map a key, considered as a vector (a, 6), to derived characters. 
We show that a variant where the derived characters are a ^ b ■ i for 
i = Q, . . . , q — 1 (using integer arithmetic) yielding (2d — l)-wise indepen- 
dence. Our analysis is based on an algebraic property that characterizes 
fc-wise independence of tabulation-based hashing schemes, and combines 
this characterization with a geometric argument. We also prove a non- 
trivial lower bound on the number of derived characters necessary for 
fc-wise independence with our and related hash classes. 



1 Introduction 

A family (multiset) Ti. of hash functions h -.U ^ R\s called fc-wise independent 
if a hash function h ^ TL selected uniformly at random maps any k distinct 
keys from U uniformly and independently to R. Such classes of functions have 
found wide application in the literature. For example, 4-wise independent hash 
functions can be used for estimating the second moment of a data stream (see 
Thorup and Zhang [8]). Pagh, Pagh, and Ruzic proved that insertions, 
queries, and deletions using hashing with linear probing will all run in expected 
constant time if the hash function used is 5-wise independent (assuming the 
number n of keys hashed is at most a constant fraction of the table size). Seidel 
and Aragon |5] introduced the treap data structure and showed that if the 
priorities of a set of keys are 8- wise independent random variables, then various 
performance guarantees hold [6] Theorem 3.3]. 

For most applications, approximate fc-wise independence, where the proba- 
bility that k keys map to fc given values may deviate by a small amount from the 
true random case, suffices. The canonical approximate fc-wise independent hash 
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family (where the range is [m] := {0, 1, . . . , m — 1}), is constructed by choosing a 
prime p > m and taking as the family the set of all mappings x i— ^ P{x) mod m, 
where P is a polynomial of degree k — 1 over Zp . 

Evaluating the polynomial is usually inefficient, especially for large values of 
k. Therefore, there has been interest in faster methods trading off arithmetic 
operations for a few lookups in tables filled with random values. 

1.1 Background on Tabulation-Based Hashing 

The idea behind tabulation-based hashing is splitting a key x into characters 
(substrings) xq, . . . , which are then hashed by q hash functions into a com- 
mon range. The resulting values are then combined with exclusive or operations 
to yield the final hash value, h{x). (For simplicity we assume for now that the 
range of h is the set of all bit strings of some fixed length.) Such a scheme was 
proposed by Carter and Wegman [I]. 

Since characters are shorter than the key, it can become feasible (and de- 
sirable) to tabulate the hash functions on the characters; that is, getting the 
hashed value of a character is just a lookup into a table position indexed 
by that character. A function from such a hash class is therefore given by 
h{xo ■ ■ ■ Xq-i) = ®i(z[g]Ti{xi), where © denotes the bitwise exclusive or op- 
eration, [q] :— {0,...,g — 1}, and T^, for each i G [g], is a table filled with 
random values. Selecting h uniformly at random selects the tables randomly; 
it is known that this scheme is 3-wise independent if the tables are filled with 
3- wise independent random values, but irrespective of the tables' contents, 4- 
wise independence cannot be achieved. (Obviously, for any four 2-character 
keys (a, c), (a, d), (6, c), (6, d), the hash value of one key is uniquely determined 
by that of the three other keys.) 

A modification to the scheme that can achieve fc-wise independence for k > 4 
is to derive a small number of characters Dq{x), Di{x), . . . , Dd-i^x) from a key 
X, and use these derived characters in place of the components of the key for table 
lookups. That is, a hash function would be given by h{x) = Ti(Di(x)), 
where To,...,Td_i are tables filled with random values. Dietzfelbinger and 
Woelfel [3] suggested to use random hash functions Di (from a c-universal hash 
family) in order to derive characters. 

Thorup and Zhang [8| proposed an efficient deterministic way of computing 
derived characters: Consider an arbitrary finite field F. Let x — (xq, . . . ,Xq-i) 
be a vector over F, and G a qxd matrix (where d > q) such that every q'Kq sub- 
matrix has full rank over F. Then z — xG forms a vector (i5o(af), . . . , Dd-i{x)) 
of d derived characters. Thorup and Zhang proved that such a class is fc-wise 
independent ii d = {k — l){q — 1) -t- 1. In fact, it suffices that each of the tables 
To, . . . ,Tfc_i is filled independently with fc-wise independent values. If the in- 
put characters are c-bit strings, then one can choose F = GF (2"^) to obtain c-bit 
strings as derived characters. Thorup and Zhang also gave a similar scheme 
for 4-wise independent hash classes (which they later ^9^ proved was also 5-wise 
independent) that required only 2q — 1 derived characters for q input charac- 
ters (as a tradeoff, some of the derived characters were slightly larger than the 
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input characters). A version of the 5-wise independent scheme for g = 2 is 
h{ab) = To{a) © Ti{b) © T2(a + 6), where '+' is either regular integer addition 
or addition modulo a suitable prime. 

1.2 Contributions 

The efficiency of tabulation based hashing is heavily influenced by the number 
of table lookups, and thus by the number of derived characters. Therefore, we 
study how many derived characters are needed to obtain fc-wise independent 
tabulation based hash functions. 

We suggest a variant of Thorup and Zhang's tabulation based hash families, 
called (q, (i)-curve hash families. For g = 2, these hash families can achieve 
(2d — l)-wisc independence using only d derived characters. Thus, for fc > 6, 
only about half as many tabic lookups in random tables, albeit slightly larger 
ones, are needed as in Thorup and Zhang's construction in order to achieve 
guaranteed fc-wisc independence. Another advantage of our construction is that 
these hash functions can be computed with simple integer arithmetic (the i-th 
derived character for a key x = (a, b) is simply a + i ■ b), as opposed to finite 
field arithmetic. In fact, the only way Thorup and Zhang's scheme can achieve 
practical performance is by using multiplication tables for the finite field multi- 
plications. In our scheme, multiplication tables are not necessary. 

Generally, a function from our (g,d)-curve hash family maps a key 
ao- • -dq-i to the y-values attained by the polynomial curve y = X]ie[9] 
z = 0, . . . ,d — 1. These y- values form the d derived characters that can be used 
for lookups in the random tables To, . . . ,Td_i. (As in other tabulation based 
hash classes, we only require that the tables To, ... , Td-i are filled independently 
with fc-wisc independent random values.) Using integer arithmetic allows for the 
question of whether a {q, d)-curve hash family is fc-wise independent to be eas- 
ily interpreted as a geometric problem regarding what intersections occur in an 
arrangement of curves. 

When q = 2, the polynomials are linear, and the problem is especially simple. 
We have the following result: 

Theorem 1. A {2,d)-curve hash family is {2d — l)-wise independent. 

We also establish a lower bound on the number of derived characters that 
are needed for fc-wise independence with a (2, rf)-curve hash family. 

Theorem 2. No {2,d)-curve hash family onU = [n]^ is (2'^) -wise independent, 
provided that n > max{2'^~^(d — 1) -f 2, 3}. 

Wc prove these theorems throTigh geometric arguments. At the heart of our 
analysis is an algebraic characterization of the tabulation-based hash classes 
which are fc-wise independent. This characterization is of independent inter- 
est. For example, it can be used to establish that every 2fc-wise independent 
tabulation-based hash class (based on fully random tables) is also (2fc-|- l)-wise 
independent. This has the immediate consequence, that for odd fc > 1, in Tho- 
rup and Zhang's construction (fc — 2)(g — 1) -|- 1 derived characters are sufficient 
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for fc-wise independence, which is g — 1 fewer characters than what their proof 
guarantees. In addition, the algebraic characterization simphfies the analysis of 
tabulation-based hash classes. 

For q > 2, we have been able to achieve only a small reduction in the number 
of required derived characters compared to Thorup and Zhang's hash families: 
We show that in general, {q, d)-cmve hash functions are fc-wise independent if 

d > '^ 2q-i ^ ^) (9^1) + whereas Thorup and Zhang's method requires 
d> (fc — l)(g — 1) -f 1. For values of k where we don't have to round up, the 
reduction in lookups needed is only by a small constant factor that decreases 
with q (e.g., 4/5 for g = 3 or 6/7 for q = 4). Although we do not expect this 
improvement to be relevant in practice (especially since using integer arithmetic 
means that larger lookup tables are needed), we believe that the theoretical 
result indicates that perhaps improved proof techniques can lead to further 
improvements in the future. 

We have executed some initial experiments (see the appendix). These show 
that (2, (i)-curve hash families outperform (often by more than a factor of 2) the 
functions from Thorup and Zhang's q = 2 class that are known to give at least 
the same degree of independence. However, for q — 4:, Torup and Zhang's func- 
tions are still more efficient than our (2,c?)-curve functions, presumably due to 
the fact that the random tables Ti are smaller and exhibit a more cache friendly 
behaviour. Nevertheless, our experiments indicate that reducing the number of 
table lookups can significantly increase the efhciency of tabulation based hash- 
ing, and that it is worthwhile trying to determine the exact independence of 
such hash classes. 



2 The Independence of Tabulation-Based Hash 
Classes 

The following definition of fc-wise independent hash classes is standard: 

Definition 1. A class % of hash Junctions of the form h : U ^ [m] is called 
fc-wise independent (where k € N), i/ Pr(Vz £ [fc] : h{xi) = Ui) = ^ for all 
distinct xq, . . . , xu-i G U and all yo, . . . , yk-i G [m], and h selected uniformly 
at random from % . 

Throughout this paper we will assume that m = 2^ for some £ G N (note 
that we are denoting the positive integers by N; the non-negative integers will 
be called No). It is known and easy to see that every fc-wise independent class is 
fc'-wise independent if 1 < fc' < fc. A hash function is called fc-wise independent 
if it is selected uniformly at random from a fc-wise independent class. 

Now we will detail tabulation-based hashing and the terminology involved. 
This description is based on the "general framework" described by Thorup and 
Zhang [Sj |9j and is also influenced by some notation used by Patra§cu and 
Thorup [S]. 
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A derivation function D maps a key a: to a sequence D(x) := {Di{x) : i G 
[d\) = {{i,Di{x)) : i G [d]}, where each Di is some function. This sequence 
is called a derived key; the element Diix) is called the i-th derived character. 
A tabulation-based hash function h : U ^ [m], using derivation function D, is 
given by 

h{x) = 0r,(A(x)) 

te[d] 

where each Ti, i G [d], is some hash function into [to] . A tabulation-based hash 
class H is a multiset of tabulation-based hash functions where each member has 
(possibly) different functions Tj, i G [d]. We will use the notation Hd to denote 
the tabulation-based hash class whose members use derivation function D. 

The intention of a tabulation-based hash function is that the computation 
of each Ti is just one table lookup. The idea is that for each i £ [d], Di{U) = 
{Di{x) : x G t/} is a subset of [n^] for some small n^, so that those tables are 
small enough to fit in fast memory. 

Definition 2. A tabulation-based hash class T-L is called fc-suitable if each table 
Ti, < i < d, is filled with k-wise independent random values, and the choices 
for Tq, . . . , Td-i are independent. 

Some previous considerations of what degree of independence are achieved 
by tabulation-based hash functions have made use of the following result: 

Lemma 1 ([8"", Lemma 2.1], see also 7, Lemma 2.6]). A k-suitable tabulation- 
based hash function is k-wise independent if for each set of keys S' of size k' < k, 
there exists an i G [d] and x G S' such that Di{x) ^ Di(y) for any j/ G 5" \ {x}. 

We will generalize this result to get a characterization of which /c-suitable 
tabulation-based hash functions are fc-wise independent. First, we make one 
more definition. 

Definition 3. For any derivation function D and set S — {xq, . . . ,Xk-i} of 
k keys, the derivation incidence matrix AdoiS) is a (0, l)-matrix having a col- 
umn corresponding to each element in Uje[fc] D{xj), and having 1 in row j and 
column [i, a) if and only if Di{xj) ~ a. 

To iUustrate, if = {x,y,z} and x has derived key {(0, 4), (1, 5), (2, 6)} = 
(4,5,6) and similarly i:)(y) = (4,7,8), D{z) = (5,7,9), then Moiix^y, z}) is 




(0,5) 


(1,5) 


(1,7) 


(2,6) 


(2,8) 


(2,9) 





1 





1 














1 





1 





1 





1 








1 



The derivation incidence matrix is unique up to reordering the rows and 
columns. 
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We show that k keys from a set S are mapped uniformly and independently 
by a random hash function from T-Ld, ii and only if the rows in AdoiS) corre- 
sponding to those k keys are linearly independent. The idea of the following 
theorem is very similar to a proof by Dietzfelbinger and Rink [1, Proposition 1] 
where a hash function was shown to be fully random. 

Theorem 3. Let Hd be a k-suitable tabulation-based hash class. Then T-Lo is 
k-wise independent if and only if for every set S of k keys, M.d{S) has full row 
rank over GF (2). 

Proof. Let S = {xo, . . . jXfc-i} be an arbitrary set of keys, and let w be the 
number of columns in M = M.o{S). 

First, suppose that M has full row rank over GF (2) (and so over 
GF (2^)=GF (m)). Pick h from a fc-suitable tabulation-based hash class Hd 
uniformly at random, and define a vector V = (uq, . . . , v^^i)^ G [m]™ such that 
if the i-th column of AI is labelled ( j, a) , then Vi = Tj (a) . Then the matrix 
multiplication M ■ V results in a fc x 1 column vector whose i-th entry is h{xi). 

Note that since there are k keys, there are at most k entries in V that 
are based on the values in the table Tj for each j 6 [d]. Since the tables 
represent independently chosen fc-wise independent functions, this means that 

V is distributed uniformly in [m]™. 

We consider the probability that M ■ V — Y, for a fixed but arbitrary 

Y G [to]*^. Since M has full row rank there is at least one solution Vq G [m]™ to 
that equation, and the set of all solutions is {Vo + Z : Z G Ker (M)} and so has 
dimension w — k hy the rank-nullity theorem. 

Therefore, since V is selected uniformly at random from [m]™, 

Pr{M.V = Y) = ^^^—. 

So Hu is fc-wise independent. 

Now suppose that J^d{S) does not have full row rank. Then there is a row, 
w.l.o.g. row fc — 1, which is a linear combination of the others. Let ^I^ be the 
set of the indices of the rows whose sum is equal to the last row. It follows that 
Pr(/i(xi) = Vi G [k]) = for any h G Hd, h{xk-i) = 0ig^ h{xi). So for any 
j/o, . . . , yk-2 and ^ ®ig>i> Vii Therefore, T-Ld is not fc-wise independent. □ 

We can see that lTheorem 3l is a generalization of[TJ Suppose the condition for 
[U that for each set S' of k' keys there are i G [d] , x G S" such that Di {x) ^ Dt (y) 
for y G S' \ {x}. Hence for each such 5" with the corresponding appropriate i 
and X, in AiniS') the column labelled {i,Di{x)) contains exactly one 1 (and 
so all rows do not sum to zero). Therefore, there is no subset of the rows of 
A4d{S) that can sum to zero, i.e. Md{S) has full row rank. 

2.1 Weaker independence 

Various properties of hash families weaker than fc-wise independence have been 
considered, such as the requirement that for all distinct xq, . . . ,Xk-i G U and 
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all ?/0i ■ • • J Vk-i G [m], and h selected uniformly at random from Ti, 



We will call such a class (fc,c)-wise independent (after [7]). It may be worth 
noting that some of the lemmas we have just given can be adapted to describe 
(A;,c)-wise independent families. 

If we modify the conditions of lTheorem 31 to only require that the tables of a 
hash function h e TLjj selected uniformly at random are independent (fc, c)-wise 
independent functions, then the probability of the vector V defined in the proof 
of lTheorem 31 assuming a particular value is bounded above by Therefore 
Hd is (/c, c'')-wise independent. 

If A4d{S) does not have full row rank, then as in the latter part of the proof 
for lThcorcm 31 let us suppose that the row corresponding to Xk-i is equal to the 
sum of the rows corresponding to the keys indexed by 4*. Then for any h S "Hd 
at least one of the m*^"^ possible (fc — l)-tuples j/o, • ■ • , yk-2 must be equal to 
{h{xo), . . . , h{xk-2)) with probability at least ^^k-i , so fixing such a tuple we 
get 



and hence can not be (k, c)-wise independent for c < m. 

2.2 Even Degrees of Independence Imply Odd Degrees 

There is an interesting result that follows from the characterization of fc-wise 
independent hash classes showing that achieving /c-wise independence for odd k 
does not require more derived characters than achieving (fc — l)-wise indepen- 
dence. 

Lemma 2. // T-Lo is a 2k-wise independent tabulation-based hash class and is 
{2k + \)-suitable, then Tio is {2k + l)-wise independent. 

Proof. Suppose for contradiction that T-Ld is not {2k + l)-wise independent even 
though it is {2k + l)-suitable. Then by Theorem [3] there exists a set of 2A; + 1 
keys S such that A4d{S) does not have full row rank over GF (2); that is, some 
linear combination of the rows sum to zero. This is equivalent to saying that 
there is a subset 5* of the rows that sum to zero. 

Since Hd is 2/c-wise independent, > 2k. Therefore, for Hd not to be 
{2k + l)-wise independent, the sum of all rows of A4d{S) must be zero. 

Let M be the submatrix of Md{S) containing only those columns corre- 
sponding to the first derived characters (i.e., those columns labeled with (0,a) 
for some a). The entries in each column of M must sum to zero (modulo 2), so 
the number of ones in each of these columns, and therefore in M, must be even. 
However, there are exactly 2fc -I- 1 ones (a single one for each key) distributed 
among these columns, so this is impossible. □ 
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This generalizes and simplifies the proof of the result by Patra§cu and Tho- 
rup [5] that any 4-wise independent tabulation-based hash class in which all 
input characters are used as derived characters is 5-wise independent. Thorup 
and Zhang's tabulation-based construction from [S] achieves fc-wise indepen- 
dence for d = {k — l){q — 1) + 1. Lemma [D implies that their construction is 
fc-wise independent for d = (fc — 2)(g — 1) -|- 1 if fc > 1 is odd. 

3 {q, (i)-curve Hash Families 

We will consider a set of hash classes that are variants of the fc-wise independent 
scheme of Thorup and Zhang [8] . 

Definition 4. For each q € {2, 3, 4, . . . }, d G a {q, d)-curve family of hash 
functions is a tabulation-based hash family with derivation function D given by 



for i G [d]. 

According to this definition, each key a = oq, . . . ,ag_i determines a poly- 
nomial curve in the plane. The j-the derived character of a, Dj{a), is then the 
y- value of the curve at x-coordinate j. This motivates the following definitions, 
which are intended to aid geometric reasoning. 

Definition 5. For any key a = oq . . .Oq-i, the corresponding key curve '■ 
K — > R is defined by "^aiz) = '^i ' ■ Given a set S of keys, we will use 

the notation '^{S) to denote the set {"^a '■ a £ S}. 

Definition 6. A column is a set {{c,y) : y £ Z} = {c} x Z for some c G Z. 

Definition 7. A bad column relative to a set of keys S is a column ^' such that 
each point in ^I' is intersected by an even (possibly zero) number of elements of 
"^{S). That is, for all (c, y) G ^ the cardinality of {a E S : 'ioa{c) = y} is even. 

Definition 8. A bad (g, d, fc)-arrangement over a set U is a set of k key curves 
(corresponding to keys in S C U) derived using a (q,d)-curve hash family and 
having d consecutive bad columns {0} x Z, {1} x Z,. . . ,{d — 1} x Z. 

For an arbitrary set S of fc' keys, 'to{S) is a bad (g, d, fc')-arrangement if and 
only if there is an even number of ones in each column of M-^^S) (i.e., A4d{S) 
does not have full row rank). Hence, by Theorem |31 a fc-suitable (g,d)-curve 
family of functions mapping from a universe U is fc-wise independent if and only 
if for every fc' G {1, . . . , fc}, there is no bad (g, d, fc')-arrangement over U. 

Let kjniixi<l,d) denote the largest integer fc such that for any k' < k there 
is no bad (g, d, fc')-arrangement over (No)'. By the discussion above, for fc = 
fcmax(gjd), a (g,d)-curve family is fc-wise independent if it is fc-suitable, but 
cannot be (fc -I- l)-wise independent if the universe it acts on is large enough to 
include the set of keys corresponding to a bad (g, d,k + l)-arrangement. In the 
following we determine upper and lower bounds on fcmax(2,d). 
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3.1 A Lower Bound for fcmax(2,(i) 

Based on the geometric observations we have just made, to prove [Theorem Tj it 
suffices to prove the following lemma. 

Lemma 3. For any d G N, and fcG{l,...,2d — 1}, there does not exist a had 
[2, d^k)- arrangement over (No)^, i.e., fcmax(Q, c^) > 2d — 1. 

Proof. Let S = {ao&o, • ■ • , afc-i&fc-i} be an arbitrary set of fc G {1, . . . , 2(i — 1} 
keys in (No)^. If k is odd then the statement is trivial, since a bad column is 
one that is intersected at every point by an even number of curves, which is not 
possible when the total number of curves intersecting the column is odd. 

Now let k be even and w.l.o.g. say that 6o > &i for all i G [k]. Suppose for 
contradiction that 'io{S) is a bad (2, d, fc)-arrangement. 

For each c G [d], "^{S) can be partitioned into {Ac,Ec,Bc}, the sets of 
curves that are respectively, above, equal to, and below "^aabo in column c. 
More precisely. A, := {/ G : /(c) > + &oc}, E, := {/ G "^{S) : /(c) = 

ao + 6oc}, and Be {/ G "^(5) : /(c) < oq + 6oc}. 

Since '^{S) is a bad arrangement, each of these subsets must have an even 
cardinality. Note that for each c G [d — 1], if / G i3c, then / G i?c+i, for since 
'^aobo is a line with a slope (6o) at least as great as any other, no line that 
is below ''^aobo at some c can ever rise above later. So \Bc+i\ > \Bc\ for all 
c G [d- 1]. 

For each c G [d — 1], \Ec\ is even so there exists f ^ Ec \ {'^aobg}. Since lines 
can only intersect once and '^ao&o has slope at least as great as any other line, 
/ G Bc+i- So |i3c+i| > \Bc\. Since the cardinality of Bc+i is also even, that 
means that \Bc+i\ > \Bc\ +2, and since |i?o| > 0, > 2(d — 1) = 2d — 2. 

For this to be possible, we must require fc > 2d — 2. But then k can only be 
2d — 1 which is odd. □ 

3.2 An Upper Bound for A;max(2,(i) 

We now consider the problem of determining how many derived characters are 
needed for fc-wise independence with a (q, d)-curve hash family. The following 
result shows that, under certain conditions, if we want to double the degree 
of independence we get from such a class, then the required value of d must 
increase by at least q — 1. 

Lemma 4. // there is a had {q, d, k)-arrangement over Z, then there is a had 
(q,d + q — 1, 2k) -arrangement over Z. 

Proof. Suppose that we have a bad (q, d, fc)-arrangement C — '^{S). We will 
use this in constructing a bad {q,d + q — 1, 2fc)-arrangement. 

Let Q{z) be a polynomial of degree q — 1 having zeros at each z G {d + i : 
i G [g - 1]} and such that C" := {Piz) + Q{z) : P{z) G C} is disjoint from C. 
Let us say that C" ~ '^{S') for some set of keys S' . 

Note that C" is, like C, bad on columns 0, . . . , d — 1 since for any Pi,P2 G C, 
Pi{z) = P2(z) if and only if Pi{z) + Q(z) = P2{z) + Q(z). Then let C" = 
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C U C ~ 'io{S U S'). Thus C" has 2k curves, and as we will show, is a bad 
{q,d + q — 1, 2fc)-arrangement. 

Consider column z, for arbitrary 2: G [d]. Since both C and C are bad on 
column z, any point {z, y) is on an even number c of curves from C and an even 
number c' of curves from C. Therefore the number of curves from C" that pass 
through (z, y) is c + c' which is also even. 

Columns d through d+q — 2 are also bad, since for any z € {d, . . . , d + q — 2}, 
P{z) = P{z) + Q{z) for all P G C. Every point in columns d through d + q — 2 
is on an equal number of curves from C and C . □ 

To applyUto describe the behavior of a hash family we need to know that the 
bad {q,d + q — l, 2fc)-arrangement constructed in the proof actually corresponds 
to keys in the universe of the hash functions in that class. The following result, 
in showing sufficient conditions on the universe size in the case of g = 2, implies 
[Theorem 21 

Lemma 5. For each d G N there is a set S of k = 2'^ keys in U — 
[n] X [n] such that "^{S) is a had {2, d, 2"^) -arrangement, provided that n > 
max{2'^-i(d- 2) + 2, 3}. In particular, fcmax(2, d) < 2''. 

Proof. We will construct, for each d > 3, a bad (2, d, 2'^)-arrangement where 
each input character is an element of [2'^^^{d — 2) + 2] (for the special cases of 
d G {1,2} we will have arrangements where we require the input characters to 
be in [3]). From this it follows that (fc = 2'')-independence cannot be achieved 
with d derived characters, assuming that the universe is large enough to have 
characters of the appropriate size. 

The proof is by induction on d. The base cases are d = 1,2,3. The ar- 
rangements {0 + Oz,0 + Iz} and {0 + lz,0 + 2z,l + Oz, 1 + Iz} of 2^ and 
2^ lines are bad on d = 1 and d ~ 2 columns respectively, and use in- 
put characters from [3] only. For d = 3, the set of 2'^ = 8 key curves 
{0 -I- 3z, + 4z, 1 + 2z, 1 + 3z, 4 + Iz, 4 -I- 2z, 5 + Oz, 5 + Iz} is bad in three 
columns and so is a bad (2, 3, 2'^)-arrangement. Note that all input characters 
are in [2'*-i(d -2) + 2]^ [2^ + 2]^ [6]. 

Suppose that there is a bad (2, d, 2'*)-arrangement C, for some d > 3, derived 
from keys where all the input characters are in [2'^~^{d — 2) -|- 2]. We now apply 
the construction in H with Q(z) = 2'^-'^{d - z) to get a bad (2, d 1,2'^+^)- 
arrangement where the input characters are in [2'^{d — 1) + 2]. 

First, define C" {{a + hz) + 2'^-^{d - z) : (a + bz) e C}. We want 
C n C" = 0: for each a + bz e C, a < 2'^-'^{d - 2) + 1 whereas for each 
a' + b'z G C", a' > 2'^^^d > 2'^-^[d - 2) + 1. 

Therefore, C" = C U C is a set of 2"^+^ key curves, derived from keys where 
the first input characters are in [{2'^^^{d -2) + 2) + 2'^-^d] = [2'^{d -I) + 2] and 
the second input characters are in {—2'^^^, . . . , 2'^^^{d — 2) + l}.By increasing 
the value of all second input characters by 2'^"^ we ensure that they are all in 
[2'^~i((i - 1) -I- 2] C [2'^{d - 1) -I- 2]. This clearly does not change which columns 
are bad. □ 
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Remark 1. // lue consider a hash class % which is a, variant of (q, d) -curve 
hash family in that the derived characters are reduced modulo r for any natural 
number r (not necessarily a prime), then any set S of keys for which '^{S) would 
he a had {q, d, k)- arrangement will not he hashed with k-wise independence by h 
selected uniformly at random from T-L. 

Therefore, any lower hound on the number of derived characters needed for 
k-wise independence with a {q,d)-curve hash family would also apply to H. 

3.3 A Lower Bound On fcmax(9, d) for q > 2 

We now consider how many derived characters are sufficient for fc-wise indepen- 
dence for {q, d)-cmve hash families where g > 2. 

Lemma 6. If C is a bad {q,q,k)-arrangement, there must be at least k/2 in- 
tersections between two curves on each column, and further to these there must 
be at least [fc/4] additional intersections that are each either between columns 
or else on a column other than the last one. 

Proof. Let C be a bad (q, q, fc)-arrangcmcnt, and choose a partition of C into 
pairs {fi,gi),i G [k/2] such that /i(0) = gi{Q) (it is possible to choose such in- 
tersecting pairs since the arrangement is bad). Furthermore, since every column 
is bad, wc can choose k/2 intersections for each column such that each curve is 
in exactly one of those intersections. 

We will associate with each pair {fi,gi), i G [k/2], an additional intersection 
(involving one or both of its members) that occurs before the last column. So 
we will have k/2 such associations, and therefore at least \{k/2)/2'] = [A:/4] 
additional intersections (since an intersection of two curves may involve curves 
from at most two associations). 

Fix an arbitrary pair {fi,gi). For each z G [d], let Az,Ez, and be the 
subsets of curves that are respectively greater than, equal to, and less than fi in 
column z. Since the arrangement is bad, each of these sets has even cardinality 
for each z (and since fi G for each z, \Ez\ > 2). If \Ez\ > 2 for some 
z G [q — 1], then there is some a G E^ \ {fi} that we can pick to associate the 
intersection of a and fi at z with the pair {fi,gi). 

Otherwise, \Ez\ = 2 for all z G[q — 1]. By construction, Eo = {fi,gi}. Since 
fi,gi are polynomials of degree q — I and so can only intersect at most q — 1 
times, there must be some z G — 1] such that gi G E^ but g ^ Ez+i] w.l.o.g. 
assume gi G B^+i. Then since |i?z+i| is even, it must be that either there exists 
some a G Az that is in B^+i or there exists some (5 G that is in Ez+ilJAz+\. 
If there is such an a, then it drops below fi somewhere between columns z 
and z + 1 , which means that there is an intersection between columns that can 
be associated with {fi,gi). Similarly, if there is such a /3, it would intersect gi 
somewhere between columns z and z + 1. □ 

Theorem 4. A k-suitable {q,d)-curve family is k-wise independent if 



d> 



(9-l) + l. 
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Proof. Let d = 2^(fc - 1) (g - 1) + 1, and suppose for contradiction that 
there is a bad {q, d, fc')-arrangement "^(S*) for some k' < k. 

Partition the first d — 1 columns into consecutive groups of g — 1 columns. 
Each column has at least k'/2 intersections on it, and each group has a further 

^ intersections either on the columns, between them, or between the last 

column of the group and the next column (this follows from taking a copy of 
't^{S) transposed so that the first column of the group becomes column 0, and 
applying the last lemma to that arrangement). 

Therefore, the total number of intersections in '^{S) is at least 



k' 



which is impossible since there can be at most {q — 1)(2) intersections between 
members of a set of k' polynomials of degree < q — 1. □ 
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Appendix 



A Experiments 

We created implementations in C of (2,(i)-curve hash classes (mapping 32 bit 
keys to 32 bit hashes) for several small values of d, using tables filled with pseu- 
dorandom numbers. We compared these against against our implementations 
of both the 2- and 4-character versions of Thorup and Zhang's scheme. 

In implementing each version of Thorup and Zhang's scheme, we choose as 
the matrix to use in deriving characters a Vandermonde matrix with a first 
row consisting of I's and a first column (excluding its intersection with the 
first row) consisting of O's (such a matrix was one of those suggested in [5]). 
The first derived character was therefore in every case the first input character. 
To perform the finite field multiplications required by their scheme, we used 
precomputcd multiplication tables. The 4-character implementation makes use 
of the optimization of having parallel additions of 8-bit values stored as segments 
of larger integers, but the 2-character one does not as it did not seem as though 
that gave an advantage. 

For each hash class, the following test was made thirty times: A member of 
the class was chosen (i.e. the tables were filled with (pseduo)random values), 
an array of 1000000 (pseudo)random keys was generated, and the array was 
iterated through 10 times, with each entry being hashed each time by the chosen 
function. The mean time (in nanoseconds) for each application of the function 
was calculated by averaging the average time for each of the thirty tests. The 
sample standard deviation between tests was also recorded. 

The following table shows these timing results. For integer d, curve2_d is a 
function from a (2,d)-curve family; tz2_c? is an implementation of Thorup and 
Zhang's q — 2 scheme with d derived characters, and tz4_(i is an implementation 
of their q — 4 scheme with d derived characters. For comparison, we also tested 
the identity function id. 

Three computers were used for testing, "crunch" , "laptop" , and "desktop" . 

Specifications of the computers are as follows: "crunch" had two 3 GHz Intel 
Xeon processors with 4 MB cache each, "laptop" had an 1.9 GHz AMD Athlon 
TK-57 processor with 512 (2 x 256) KB cache, and "desktop" had a a 3 Ghz 
Pentium 4 processor with 2 MB cache. Note that the code was compiled for 
each machine separately, and as a 32-bit executable on "desktop" and as 64-bit 
executables on the others. 

We see that in all tests {2,d)-cmve functions outperformed (often by more 
than a factor of 2) functions from Thorup and Zhang's q ^ 2 class that are 
known to give as much independence. 

However, the tz4_ci functions were often the best (always so on laptop and 
desktop). On crunch the T tz4_c? functions did not seem to be sped up as 
much as the others overall. A possible reason is that the tzA.d functions do not 
benefit so much from a larger cache as the other functions because the derived 
characters used by tz4_c? are only 8 bits long (so the lookup tables are small). 
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guaranteed k 


function 


time on crunch 
mean (ns) SD 


time on laptop 
mean (ns) SD 


time on desktop 
mean (ns) SD 





id 


2.08 


0.0029 


2.28 


0.6609 


4.21 


0.0027 


7 
7 
7 


curve2_4 

tz2_6 

tz4_16 


13.67 
19.73 
22.71 


0.4724 
0.1768 
0.0367 


109.93 
233.40 
44.15 


0.6524 
57.0687 
0.7651 


72.93 
104.85 
56.62 


8.7217 
4.9674 
0.0888 


9 
9 

9 


curvc2_5 

tz2_8 

tz4_22 


29.53 
37.19 

33.81 


0.6552 
2.0565 

0.0464 


155.91 
328.88 
60.47 


0.8912 
69.3904 

0.9717 


112.49 
184.06 
81.80 


0.7536 
1.9557 

0.4511 


11 
11 
11 


curvc2_6 

tz2_10 

tz4_28 


49.37 
76.91 
45.38 


0.3285 
3.9032 
0.0588 


195.65 
432.71 
73.47 


1.8137 
64.0526 
0.7742 


154.21 
290.92 
103.63 


0.8712 
1.7706 
1.8915 


13 
13 
13 


curvc2_7 

tz2_12 

tz4_34 


72.96 
132.85 
58.51 


0.2663 
2.7763 
0.0703 


239.33 
557.84 
106.74 


2.8124 
84.8628 
37.9402 


202.28 
398.16 
120.98 


1.1247 
1.4015 
0.2935 


15 
15 
15 


curve2_8 

tz2_14 

tz4_40 


96.75 
187.16 
69.07 


0.3063 
2.0494 
0.0772 


291.42 
676.39 
149.44 


5.8155 
154.8639 
64.3039 


257.54 
492.78 
151.02 


1.0047 
6.6196 
0.1966 


17 
17 
17 


curve2_9 

tz2_16 

tz4_46 


120.73 
241.09 
83.55 


0.2728 
0.9494 
0.1112 


339.61 
870.07 
192.05 


4.9303 
251.6200 
61.9977 


310.56 
575.41 
170.63 


0.9222 
1.4096 
2.4714 


19 
19 
19 


curve2_10 

tz2_18 

tz4_52 


143.09 
297.71 
97.54 


0.4698 
0.6359 
0.0715 


427.99 
1050.25 
218.52 


55.5387 
346.5042 
69.6953 


365.78 
671.39 
171.57 


1.2003 
2.0122 
3.2114 



Figure 1: Timing results witli sample standard deviations for hash functions 
that have been proven to be fc-wise independent for small values of k. 
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